Introduction

Background and motivation

Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.

Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.

Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.

Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.

Project objectives

The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.

First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.

Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.

Research questions

  • What is the evolution of student performance over time and across the different regions/departments of France?

  • Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

  • Has the COVID-19 pandemic impacted student performance?

Data

Import data

Data Wrangling

We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :

  1. Translate the column names.
  2. Make sure that all data are of the right type.
  3. Make sure that the time reference (year) are all aligned with the exam session.
  4. Add a column department_fr that with department names matching the ones in ggplot.
  5. Verify if the data set needs further wrangling.

Function

As we will rename all data sets in the same way, we found it useful to create a function. The function has a data frame and a vector as inputs. It checks if the length of the vector is correct, if so it returns a tibble with the column names renamed


rename_df <- function(df, x){
  if (ncol(df) == length(x)){
    names(df) <- c(x)
    df <- as_tibble(df)
  } else {
    stop("Vector is not the right length")
  }
}

dnb_results

This dataset presents the results of the diplôme national du brevet by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.

  1. Translate the column names.

dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)
  1. success_rate is of the form ,% we want it as a double of the form .

dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
                                               gsub(",",".", dnb_results[["success_rate_pct"]])))
  1. We need to harmonize the year variables of all the other data sets to match the logic of this one. The year is the year of the exam session (e.g academic period “2020-2021” is represented as 2021)

  2. We need to add a column department_fr that with department names matching the ones in ggplot.


dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. We want some to know the attribution rate of each mention to simplify the comparison.
dnb_results <- dnb_results %>% 
  mutate(without_pct = admitted_without/admitted*100,
         AB_pct = admitted_AB/admitted*100,
         B_pct = admitted_B/admitted*100,
         TB_pct = admitted_TB/admitted*100
         )

We can see the final table dnb_results below.

establishment_24

This data set gathers all schools which have been awarded the “Generation 2024” label. Its aim is to develop bridges between the school world and the sports movement in order to encourage young people to practice physical activity and sport. This data set contains 6’883 observations.

  1. Translate the column names.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
  1. No problem for this data set
  2. We need to add two variables session_started and session_ended. Indeed as the label has a start and an End date we have to trace the first session and the last session where the establishment have the label generation 2024.Most labellisations start and end in January but a few start and end in middle of the year. Exams take place end of June, beginning of July. Therefore, we will consider labellisation done in August and after as done for the next academic year.

establishment_24 <- establishment_24 %>% 
  mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
                                     month(label_start_date) >  7 ~ year(label_start_date)+1),
         session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
                                   month(label_end_date) >  7 ~ year(label_end_date)+1)
         )
  1. We need to add a column department_fr that with department names matching the ones in ggplot.

establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. We can create a high_school_type variable instead of agricultural_high_school, military_high_school, vocational_high_school and drop those variables.

establishment_24 <- establishment_24 %>% 
  mutate(high_school_type = case_when(agricultural_high_school == 1 ~ "agricultural high school",
                                      military_high_school == 1 ~ "military high school",
                                      vocational_high_school == 1 ~ "vocational high school")) %>% 
    select(region:sport_section, high_school_type, establishment_web:session_ended)

We can see the final table establishment_24 below.

student_housing

This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.

  1. Translate the column names.

housing_names <- c("year_back_to_school", "Academic_region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)
  1. No Problem for this data set

  2. We need to create a session variable as year_back_to_school refers to the beginning of the school year and not the exam session.


student_housing <- student_housing %>% 
  mutate(session = year_back_to_school + 1) %>% 
    select(year_back_to_school,session, everything()) #here just to order variables
  1. We need to add a column department_fr that with department names matching the ones in ggplot.
student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. No need for further data wrangling for this data set

We can see the final table student_housing below.

single_parent

This data set provides information about the single-parent families in each municipality. The census has been made every five years since 2008. This data set contains 104’986 observations

  1. Translate the column names.
sg_parent_names <- c("geocode", "municipality", "year","sing_par")

covid_in_schools

This is a time based data set that gives us information on the COVID tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 11th October, the data set contained 543’974 observations.

  1. Translate the column names.

    can try to implement API (further idea)

#1
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(Covid_sp_dep_7j_cage_scol_2022_10_10_19h02,covide_names)
  1. test_week will be treated in (3.). positive, incidence_rate and positivity_rate need to be doubles
covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))
  1. We need to create two new variables. The first will be the date categorizing each week. We chose the first date (Monday). The test for a session will be those from August to July of the next year. As our argument will be set on the month, we might have some test done the first days of august count towards the “wrong” session. The number of Covid cases in August are relatively low compared to the rest of the year and it represents at maximum 6 days of tests. Therefore we consider this margin of error to be satisfactory.
covid_in_schools <- covid_in_schools %>% 
  mutate(test_date = ymd(substr(test_week,1,10)),
         session = case_when(month(test_date) <= 7 ~ year(test_date),
                             month(test_date) >  7 ~ year(test_date)+1))
  1. Only department code. We need to input the names of the departments.

  2. To simplify the dataset, we need to drop all but 11-15 educational_level.

We can see the final table covid_in_schools below.

Auxiliary data sets

We will use the ggplot France map for our visualizations

map <- map_data("france")

The region variable is in fact the departments. A quick renaming is necessary.

colnames(map)[5]<- "department_fr"
  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

Exploratory data analysis

Essai

We are mostly exploring our options and have yet to truly deep dive into this portion of our project.

##National analysis

National DNB statistics

National DNB statistics (rate)

Regional Analysis

Number of admitted by region

Number of students per result and region during the period 2006-2021

Success rate

Admitted with zero mention

Admitted with mention AB

Admitted with mention B

Admitted with mention TB

###{-}

Period 2006-2021 by region

Success rate

Admitted with zero mention

Admitted with mention AB

Admitted with mention B

Admitted with mention TB

###{-}

Departmental Analysis

Box plot Analysis per department

success_rate_pct

without_pct

B_pct

AB_pct

TB_pct

Box plot Analysis of at the Department level

Paris 2020

First analysis of the best performing highest rate of TB -> Paris 2020

Guyane 2006

Lowest performing department in TB rate and “best” in zero mention rate -> Guyane 2006

Establishment_24

esssai carte

Creation of the map theme


map_theme <- theme(title=element_text(),
                   plot.title=element_text(margin=margin(20,20,20,20), size=18, hjust = 0.5),
                   axis.text.x=element_blank(),
                   axis.text.y=element_blank(),
                   axis.ticks=element_blank(),
                   axis.title.x=element_blank(),
                   axis.title.y=element_blank(),
                   panel.grid.major= element_blank(), 
                   panel.background= element_blank()) 

Creation of the dataset used for the map


result <- dnb_results %>% 
  select(department_fr, success_rate_pct) %>% 
  group_by(department_fr) %>% 
  summarise(mean = mean(success_rate_pct))

Join the map from ggplot and our new dataset


result_map <- left_join(x = map[,-6], y = result)

plot

  • Mapping out the underlying structure
  • Identifying the most important variables
  • Univariate visualizations
  • Multivariate visualizations
  • Summary tables

Analysis

regression Essai

lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#> 
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + 
#>     dnb_results$AB_pct + dnb_results$without_pct)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -2.30e-09  0.00e+00  0.00e+00  0.00e+00  1.53e-10 
#> 
#> Coefficients:
#>                          Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept)              1.00e+02   1.53e-13  6.54e+14   <2e-16 ***
#> dnb_results$without_pct -1.00e+00   1.63e-15 -6.12e+14   <2e-16 ***
#> dnb_results$B_pct       -1.00e+00   2.93e-15 -3.41e+14   <2e-16 ***
#> dnb_results$AB_pct      -1.00e+00   1.96e-15 -5.11e+14   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.25e-12 on 139537 degrees of freedom
#>   (39 observations deleted due to missingness)
#> Multiple R-squared:     1,   Adjusted R-squared:     1 
#> F-statistic: 1.79e+29 on 3 and 139537 DF,  p-value: <2e-16

cluster Essai

dnb_pct_dep <- dnb_results %>%
  group_by(department, session) %>% 
  summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
            B_pct_dep = mean(B_pct, na.rm = TRUE),
            TB_pct_dep = mean(TB_pct, na.rm = TRUE),
            without_pct_dep = mean(without_pct, na.rm = TRUE),
            success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))

pairs(dnb_pct_dep[2:6])

distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion

mydata.hclust <- hclust(distance)
plot(mydata.hclust)

Conclusion